If you're a foodie like LT7 members, you know the struggle of deciding what to cook for your next meal. With thousands of recipes available online, it can be overwhelming to choose just one. That's where a personalized recommendation system comes in handy. And that's precisely what this project aims to achieve for dessert lovers on food.com!
The pandemic has led to a surge in home cooking, and food.com has become a go-to destination for many recipe seekers. However, the platform lacks a recommendation system to help users navigate through the vast collection of recipes available. This project addresses that gap by creating a recommender system based on user preferences.
Using a Kaggle dataset of scraped recipes and user interactions from food.com, the team implemented a comprehensive data science pipeline to generate personalized recommendations for clustered users and items. The dataset was pre-processed to remove duplicates and filter out unnecessary data, and dimensionality reduction techniques were applied to manage computational efficiency and memory constraints. The team used clustering techniques to identify user and item clusters, and various collaborative filtering methods were used to generate recommendations based on user ratings.
The study found that the most effective was the latent-factor-based collaborative filtering technique which provided personalized recommendations with high coverage, balanced novelty score, and high intra-list similarity score. This means that users can get recommendations that are tailored to their preferences while still exploring new options. The study also recommends further improvements by applying other clustering techniques and algorithms, content-based collaborative filtering, and exploring other food categories.
With this project, food.com users can enjoy a personalized recommendation system that improves their experience and engagement on the platform. Whether you're a dessert enthusiast or a curious recipe seeker, this recommender system will help you find your next favorite dessert recipe in no time!
The recent pandemic has led to a renewed interest in home cooking. With the closure of food establishments, people were left with little choice but to prepare their meals. According to a survey conducted by Hunter [1], a food and beverage marketing agency, 54% of Americans said they were cooking more at home since the pandemic started, 67% had increased confidence in their cooking abilities, and 48% said they were trying new recipes. A similar survey conducted by the International Food Information Council [2] reflected similar results, with 60% of respondents saying they are cooking at home more often.
Food recipe websites played a crucial role in this trend. They provided quick access to a wide selection of recipes, allowing users to search for recipes based on the available ingredients and cooking materials they had. In addition, with concerns about health and wellness during the pandemic, many people turned to these websites for ideas on how to prepare healthy meals. Food websites also offered resources on healthy eating, including tips on portion control and developing meal plans. The more popular websites also include forums and comment sections where users can discuss with one another. These gave home cooks a sense of community during a time of extended isolation.
Among the most popular food websites is food.com [3]. Founded in 2004, food.com is a collaborative platform where home cooks can share their favorite recipes and culinary creations. It has one of the largest and still growing collections of recipes online, rivaled only by other big food websites such as allrecipes.com and foodnetwork.com. It has one of the most active communities of users who rate and review recipes that they have tried, as well as share cooking tips and tricks with beginners. More recently, food.com introduced a meal planning tool and a shopping list tool wherein users can choose recipes from the collection and these tools will return a schedule of dishes for each day and a list of ingredients to buy.
One of the shortcomings of food.com is its lack of a personalized recommendation system. Such a system helps users on streaming websites like YouTube and Netflix to decide on what to watch next, with the latter going as far as offering a $1 million prize for developers who can beat its algorithm back in 2006 [4]. Recommender systems are commonly used by e-commerce platforms such as Amazon and Shopee to suggest products to their users based on their browsing and purchase history. This helps to increase customer satisfaction, reduce search costs, and improve sales.
Currently, food.com has a fixed set of recommendations based on the current recipe that a user is viewing. With the thousands of recipes available on the platform, users will benefit from a system that recommends recipes based on the previous recipes that a user has rated well. For beginners, recommending similar recipes will help them achieve mastery of a certain type of dish instead of doing a mediocre job on a wide variety of dishes. This is especially true for desserts since they are known to have a higher learning curve compared to other types of dishes. Recommending the correct recipes makes them more likely to continue cooking and in turn, continue visiting the website. Recommended recipes with similar ingredients also help in managing stock and reducing spoilage. More importantly, such systems help in increasing website traffic and user engagement by minimizing the user's search fatigue—showing users what they want to see with minimal effort on their part. The less time users spend on searching, the more time they can allot to writing reviews, posting comments in the forums, and trying out the website's lesser-known features.
The choice of focus is intentional. Recommender systems work best when there is high variability in the recommendable items in the system. This makes it more challenging to create such systems in niche-specific applications. The team wanted to modify the usual algorithms in recommender systems to include clustering techniques. The effect of the modification will not be apparent if the data already performs well for vanilla implementations.
There is no difference between recommender systems for food recipes and recommender systems for food orders. Naturally, this work can be extended to dessert e-commerce shops. The team was able to identify two local shops that will benefit greatly from a recommender system: Kukido [5] and Lacher Patisserie [6]. Both stores have an e-commerce platform, but they do not provide customized product recommendations to their users. Instead, they only display their monthly best sellers. This is a missed opportunity for returning customers who have a hard time deciding what else to buy. Having personalized recommendations will improve user experience and increase the companies' sales.
The dataset comes from a Kaggle dataset of scraped recipes and user interactions from food.com. The dataset is available in the jojie public dataset repository.
Filepath: /mnt/data/public/food-com-recipes
Dataset Summary:
Original Data Source: This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen). It was used in the following paper:
Generating Personalized Recipes from Historical User Preferences
Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley
EMNLP, 2019
https://www.aclweb.org/anthology/D19-1613/
The dataset can be downloaded from Kaggle: https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions
Content: This dataset contains three sets of data from Food.com:
Interaction Splits:
Preprocessed data for result reproduction:
In this format, the recipe text metadata is tokenized via the GPT subword tokenizer with start-of-step, etc. tokens.
Raw data:
| id | integer | nominal | Recipe ID |
| i | integer | nominal | Recipe ID mapped to contiguous integers from 0 |
| name_tokens | object | list(string) | BPE-tokenized recipe name |
| ingredient_tokens | object | list(string) | BPE-tokenized ingredients list (list of lists) |
| steps_tokens | integer | list(string) | BPE-tokenized steps |
| techniques | integer | list(string) | List of techniques used in recipe |
| calorie_level | integer | categorical | Calorie level in ascending order |
| ingredient_ids | object | list(string) | IDs of ingredients in recipe |
| id | i | calorie_level | |
|---|---|---|---|
| count | 178265 | 178265 | 178265 |
| mean | 213462 | 89132 | 0.863192 |
| std | 138267 | 51460.8 | 0.791486 |
| min | 38 | 0 | 0 |
| 25% | 94576 | 44566 | 0 |
| 50% | 196312 | 89132 | 1 |
| 75% | 320562 | 133698 | 2 |
| max | 537716 | 178264 | 2 |
| u | integer | ordinal | User ID mapped to contiguous integer sequence from 0 |
| techniques | object | list(string) | Cooking techniques encountered by user |
| items | object | list(string) | Recipes interacted with, in order |
| n_items | integer | nominal | Number of recipes reviewed |
| ratings | object | list(string) | Ratings given to each recipe encountered by this user |
| n_ratings | integer | nominal | Number of ratings in total |
| u | n_items | n_ratings | |
|---|---|---|---|
| count | 25076 | 25076 | 25076 |
| mean | 12537.5 | 27.8713 | 27.8713 |
| std | 7238.96 | 122.729 | 122.729 |
| min | 0 | 2 | 2 |
| 25% | 6268.75 | 3 | 3 |
| 50% | 12537.5 | 6 | 6 |
| 75% | 18806.2 | 16 | 16 |
| max | 25075 | 6437 | 6437 |
| user_id | integer | nominal | User ID |
| recipe_id | integer | nominal | Recipe ID |
| date | string | string | Date of interaction |
| rating | integer | nominal | Rating given |
| review | string | string | Review text |
| user_id | recipe_id | rating | |
|---|---|---|---|
| count | 1.13237e+06 | 1.13237e+06 | 1.13237e+06 |
| mean | 1.38429e+08 | 160897 | 4.41102 |
| std | 5.01427e+08 | 130399 | 1.26475 |
| min | 1533 | 38 | 0 |
| 25% | 135470 | 54257 | 4 |
| 50% | 330937 | 120547 | 5 |
| 75% | 804550 | 243852 | 5 |
| max | 2.00237e+09 | 537716 | 5 |
| name | string | string | Recipe name |
| id | integer | nominal | Recipe ID |
| minutes | integer | nominal | Minutes to prepare recipe |
| contributor_id | integer | nominal | User ID who submitted this recipe |
| submitted | string | string | Date recipe was submitted |
| tags | object | list(string) | Food.com tags for recipe |
| nutrition | object | list(string) | Nutrition information |
| n_steps | integer | nominal | Number of steps in recipe |
| steps | object | list(string) | Text for recipe steps, in order |
| description | string | string | User-provided description |
| id | minutes | contributor_id | n_steps | n_ingredients | |
|---|---|---|---|---|---|
| count | 231637 | 231637 | 231637 | 231637 | 231637 |
| mean | 222015 | 9398.55 | 5.53489e+06 | 9.7655 | 9.05115 |
| std | 141207 | 4.46196e+06 | 9.97914e+07 | 5.99513 | 3.7348 |
| min | 38 | 0 | 27 | 0 | 1 |
| 25% | 99944 | 20 | 56905 | 6 | 6 |
| 50% | 207249 | 40 | 173614 | 9 | 9 |
| 75% | 333816 | 65 | 398275 | 12 | 11 |
| max | 537716 | 2.14748e+09 | 2.00229e+09 | 145 | 43 |
| user_id | string | nominal | User ID |
| recipe_id | integer | nominal | Recipe ID |
| date | string | string | Date of interaction |
| rating | float | nominal | Rating given |
| u | integer | nominal | User ID, mapped to contiguous integers from 0 |
| i | integer | nominal | Recipe ID, mapped to contiguous integers from 0 |
| user_id | recipe_id | rating | u | i | |
|---|---|---|---|---|---|
| count | 12455 | 12455 | 12455 | 12455 | 12455 |
| mean | 2.91269e+07 | 209323 | 4.21309 | 12288.5 | 115488 |
| std | 2.33436e+08 | 135002 | 1.3385 | 6897.75 | 50448.7 |
| min | 1533 | 120 | 0 | 2 | 102 |
| 25% | 169842 | 94616 | 4 | 6428.5 | 76904 |
| 50% | 382954 | 195040 | 5 | 12023 | 127793 |
| 75% | 801637 | 314928 | 5 | 17985.5 | 160024 |
| max | 2.00225e+09 | 537716 | 5 | 25074 | 178264 |
| user_id | string | nominal | User ID |
| recipe_id | integer | nominal | Recipe ID |
| date | string | string | Date of interaction |
| rating | float | nominal | Rating given |
| u | integer | nominal | User ID, mapped to contiguous integers from 0 |
| i | integer | nominal | Recipe ID, mapped to contiguous integers from 0 |
| user_id | recipe_id | rating | u | i | |
|---|---|---|---|---|---|
| count | 698901 | 698901 | 698901 | 698901 | 698901 |
| mean | 1.24769e+07 | 156173 | 4.57409 | 4249.33 | 87519.3 |
| std | 1.52503e+08 | 126595 | 0.959022 | 5522.6 | 51290.4 |
| min | 1533 | 38 | 0 | 0 | 0 |
| 25% | 105988 | 53169 | 4 | 455 | 42988 |
| 50% | 230102 | 116484 | 5 | 1737 | 87424 |
| 75% | 480195 | 234516 | 5 | 5919 | 131731 |
| max | 2.00231e+09 | 537458 | 5 | 25075 | 178262 |
| user_id | string | nominal | User ID |
| recipe_id | integer | nominal | Recipe ID |
| date | string | string | Date of interaction |
| rating | float | nominal | Rating given |
| u | integer | nominal | User ID, mapped to contiguous integers from 0 |
| i | integer | nominal | Recipe ID, mapped to contiguous integers from 0 |
| user_id | recipe_id | rating | u | i | |
|---|---|---|---|---|---|
| count | 7023 | 7023 | 7023 | 7023 | 7023 |
| mean | 1.94779e+07 | 206406 | 4.23281 | 10298 | 100122 |
| std | 1.90469e+08 | 135238 | 1.30291 | 6709.5 | 52051.1 |
| min | 1533 | 120 | 0 | 5 | 144 |
| 25% | 159119 | 89851.5 | 4 | 4569.5 | 56227 |
| 50% | 352834 | 192146 | 5 | 9248 | 104819 |
| 75% | 737332 | 311632 | 5 | 15637.5 | 146690 |
| max | 2.00223e+09 | 536464 | 5 | 25055 | 178263 |
Several pre-processing steps were undertaken to achieve the desired item profile matrix and utility matrix to be used in the recommender system:
The most common ingredients in desserts are Brown Sugar and Baking Soda.
The most common dessert names are those with Chocolate and Apple.
While Cake and Cookie are the most common dessert type.
10,012 ratings are 5.0 which is 76% of the total rows.
3,372 items have only one rating which is 59% of the total dataset.
1. Raw Data Exploration
2. Data Cleaning and Preprocessing
user data to:recipe data to:3. Data Vectorization
recipe data using TF-IDF vectorizer to convert the recipes into features.4. Dimensionality Reduction
user data.recipe data.5. Clustering
user data.recipe data.6. Recommender System
Although it is possible to cluster the data using various clustering techniques at the onset, it is highly recommended to perform dimensionality reduction to reduce computational complexity. It is possible to get away with decent results without performing dimensionality reduction, however, this is not the case for data of high dimensionality. This is what we call the curse of dimensionality, a phenomenon that leads to poor clustering performance.
In this specific context, we perform Singular Vector Decomposition (SVD) to both the user data and the recipe data. The choice of dimensionality reduction technique will depend on the nature of the data, in this case, however, SVD is more fitting, especially when data is sparse.
We then check the number of singular vectors to retain if we want to preserve at least 80% of the information. This can be measured by the cumulative variance explained. Refer to the plot below.
By plotting the first two Singular Vectors, we can visualize what the data looks like in two-dimensional space. Normally, the first two singular vectors should contain most of the information, but in this case, the first two singular vectors only represent approximately 2% of the information.
Next, we plot the most important features of the first five singular vectors. By doing so, we can visualize the contributions of each feature to the overall information contained in the dataset.
Again, we perform Singular Vector Decomposition, but this time, we apply it to the recipe dataset. We also decided to retain at least 80% of the information which is about 31 singular vectors.
For purposes of visualization, we plot the first two singular vectors, this equates to about 17% of the information. Again, this plot does not capture the true nature of the data.
Quick checkpoint: Let us summarize why we needed to perform dimensionality reduction is necessary:
Dimensionality reduction helps address these issues by reducing the number of dimensions in the data while preserving the most important information. In general, this improves the performance of clustering algorithms through the following:
Now that we have performed dimensionality reduction, we can now perform clustering knowing that we are going to get higher quality clusters. We must take note that in selecting the optimal number of clusters, we must rely on the internal validation scores instead of the plot visualization. The plot only represents a relatively small amount of information, hence judging the clusters visually can lead to poor conclusions.
For clustering, we have decided to perform the following techniques: Note: we only showed three clustering techniques in the main body. The other techniques have been moved over to the appendices.
1. Representative-based Clustering
2. Hierarchical Clustering
3. Density-based Clustering
4. Probabilistic Clustering
First, we perform various clustering algorithms on the user dataset. Please refer to exhibit 1 for other clustering techniques performed on the user dataset.
For our clustering technique, we ultimately decided to use K-means clustering as it provides the following advantages:
We performed hyperparameter tuning using a simple grid-search algorithm where we perform clustering using a range of values to search for the best number of clusters.
Here is a summary of the scores:
Using the DB index, we identified that our optimal number of clusters is k=2.
As an extra step, we analyze the result of our clustering at k number of clusters k=2. Here are our observations:
Cluster 0
Cluster 1
For the other observations, refer to the information found in the tables and plots below.
Looking at the plots, the labels can be as follows:
Cluster 0
Cluster 1
Based on this information, we can then label our clusters.
We also perform clustering to the recipe data following the same clustering method we applied to the user data. It must also be noted that reliance on visualizations is a big red flag. Instead, we must choose our optimal number of clusters using the internal validation criteria.
Please refer to exhibit 2 the other clustering techniques performed on the recipe dataset.
We also performed hyperparameter tuning using a simple grid-search algorithm where we perform clustering using a range of values to search for the best number of clusters.
Here is a summary of the scores:
Using the DB index and Silhouette coefficient, we identified that our optimal number of clusters is k=2.
Now that we have identified the optimal number of clusters at k=2, let us further explore the characteristics of our clusters by performing a simple analysis. For each cluster, let us take the mean of each feature. This will give us a grasp of the characteristics of the data points contained within each cluster. Here are our observations:
Cluster 0
Cluster 1
From this, we can infer that cluster 1 is comprised of high-nutrient desserts whereas cluster 0 is comprised of low-fat desserts. Of course, this method is not entirely reliable, as we are only looking at the raw values taken from the mean of each cluster. Hence in the next step, we will perform hypothesis testing to identify the significant features of each cluster.
If you've taken an advanced statistics class, chances are, your professor may have mentioned this along the lines of his/her discussion:
"In statistics, we do NOT eyeball"
Hypothesis testing is an essential statistical tool used to make decisions about a population based on sample data. It allows us to test whether an observed effect is statistically significant or whether it could have occurred by chance.
Our chosen statistical test is the SFIT method. The Single Feature Introduction Test (SFIT) is a simple and computationally efficient significance test for the features of a machine learning model. It identifies the statistically significant features as well as feature interactions of any order in a hierarchical manner. [14][15]
Before we can perform classification, we first labeled each company using its cluster assignment from our chosen clustering algorithm (K-means, at k=2).
To perform classification on the clusters, we trained a 3-hidden-layer neural network that has ReLU activation functions, a first hidden size of 100, a second hidden size of 50, and a third hidden size of 25. The network is trained for at most 50 epochs using the Adam optimizer which has pruning capabilities.
After running the classification, we perform the Single Feature Introduction Test (SFIT) on the trained network by only using the data for a specific cluster, and returning a cluster's most important features.
Here are the five most important features of each cluster:
Cluster 0
Cluster 1
These features are what separate one cluster from the other. However, the results do not tell us exactly why these features are considered the most important. It does not tell us the feature's abundance or lack thereof.
Since we are comparing only two clusters, it is likely to get common significant features for both clusters. In our case, we got calories as a common significant feature. This could mean that for the calorie feature, both clusters are likely to demonstrate a significant difference in their mean or median.
To augment our results from the SFIT method, we can check the values for each of these features for both clusters. If you remember, earlier, we took a look at the mean/median of calories and saturated fat for both features, and truly, these clusters are significantly different in terms of calories and saturated fat.
Recap:
Cluster 0
Cluster 1
Now that we have this information, we can then label our clusters.
This section is the most important part of this study. This section directly answers the problem statement (why this study exists). Let us recap:
What?
Why?
Now before we proceed, let us first ask some questions to better understand what recommender systems are and what value they provide to a business.
What is a recommender system?
A recommendation system is a subclass of Information filtering Systems that seeks to predict the rating or the preference a user might give to an item. In simple words, it is an algorithm that suggests relevant items to users. Eg: In the case of Netflix which movie to watch, In the case of e-commerce which product to buy, or In the case of kindle which book to read, etc. [16]
Why are recommender systems important?
Recommender systems are important because they help users discover and engage with content or products that are most relevant and interesting to them. In an era of information overload, recommender systems play a critical role in filtering and personalizing content for individual users.
Here are some specific reasons why recommender systems are important:
To summarize, recommender systems are powerful business tools that provide personalized recommendations to users. It helps users discover new content and products that they may not have found otherwise, while also driving sales and revenue for businesses.
There are many types of recommender systems and we can go all day talking about them. However, for this study, we need to help Chef Almond choose the best type of recommender system out of the following:
For this study, we will be using the Scikit-Surprise Library. The scikit-surprise library is a popular Python library for building recommender systems using collaborative filtering techniques. We algorithms we decided to use user-based kNN, item-based kNN, and SVD. Here's a comparison of the user-based k-NN, item-based k-NN, and SVD algorithms in the scikit-surprise library:
After implementing the three aforementioned algorithms, we will then evaluate each model and identify which of the three performs the best.
This algorithm is an implementation of user-based collaborative filtering using k-NN. It computes similarities between users and uses the k-nearest neighbors to make recommendations. User-based k-NN can be fast and effective for small to medium-sized datasets but can suffer from scalability and sparsity issues for larger datasets.
This algorithm is an implementation of item-based collaborative filtering using k-NN. It computes similarities between items and uses the k-nearest neighbors to make recommendations. Item-based k-NN can be faster and more scalable than user-based k-NN and can handle sparsity better. However, it may not work as well for datasets with a large number of items.
The SVD algorithm is a latent factor-based collaborative filtering algorithm that uses matrix factorization to learn latent factors that capture user preferences and item characteristics. It can handle large datasets and sparse data well and can be more accurate than k-NN methods in many cases. However, it can be slower to train and can be more difficult to interpret.
Now that we have finished building the three recommender systems, how do we compare them to each other?
There are debates about how to evaluate a recommender system or what KPIs should be employed to make a good comparison. Recommender systems can be evaluated in many ways using several metrics where each metric group has its purpose. However, for this study, we will be comparing our recommender systems using the following metrics:
Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are used to evaluate the accuracy of predicted values that such as ratings compared to the true value, y. These can also be used to evaluate the reconstruction of a rating matrix. [13]
Coverage is the percent of items in the training data the model can recommend on a test set.[9][11]
Where 'I' is the number of unique items the model recommends in the test data, and 'N' is the total number of unique items in the training data. The catalog coverage is the rate of distinct items recommended over a period of time to the user. For this purpose, the catalog coverage function take also as parameter 'k' the number of observed recommendation lists. In essence, both metrics quantify the proportion of items that the system can work with. [13]
To better understand this, say we have 100 items in our training data, and we set the number of recommendations=10 where the recommendations are always the same set of items for all users. Then this would mean that the coverage is only 10% since it only ever recommends the same 10 items from the original data. In math terms, it is the union of all user recommendations over the total number of items.
Here, the user-based collaborative filtering has an 18.97% coverage, while the item-based collaborative filtering is only able to recommend 1.58% of the items it was trained on. The recommender system with the highest coverage is the latent factor-based collaborative filtering with 41.42% coverage.
Novelty measures the capacity of a recommender system to propose novel and unexpected items which a user is unlikely to know about already. It uses the self-information of the recommended item and it calculates the mean self-information per top-N recommended list and averages them over all users.[13]
Where the absolute U is the number of users, count(i) is the number of users who consumed the specific item and N is the length of the recommended list.[13]
The novelty metric in recommender systems measures the degree to which recommended items are new and unexpected to the user. It is a way of quantifying how much the recommendations can introduce the user to new items that they have not previously encountered or considered. [8][10]
The novelty metric is important in recommender systems because it can help to prevent the problem of "filter bubbles," where users are only recommended items that align with their existing preferences and interests. By introducing users to new and unexpected items, recommender systems can help to broaden their horizons and expose them to new ideas and experiences. [8][10]
Personalization is a great way to assess if a model recommends many of the same items to different users. It is the dissimilarity (1- cosine similarity) between a user’s lists of recommendations. An example will best illustrate how personalization is calculated. [9]
A high personalization score indicates a user’s recommendations are different, meaning the model is offering a personalized experience to each user. [9]
Summary:
The intra-list similarity is the average cosine similarity of all items in a list of recommendations. This calculation uses features of the recommended items (such as movie genre) to calculate the similarity. This calculation is also best illustrated with an example. [9]
If a recommender system is recommending lists of very similar items to single users (for example, a user receives only recommendations of romance movies), then the intra-list similarity will be high. [9]
Now that we have all the scores for each recommender system, let us compare them using a radar plot. Here, we compare the personalization, coverage, and intra-list similarity. The reason why novelty and error scores are not part of this radar plot is that they operate under a different scale and must be visualized differently. Regardless, let us proceed with the comparison using the radar plot.
How do we know that one model is outperforming the others? In a radar plot, the model that has the largest area is considered to be the best performing.
In this case, the latent-factor-based model outperforms the other two by quite a huge margin, especially in personalization and coverage.
Let us put all the values in a single table so we can compare the numbers. Let us summarize these metrics and put them into context:
From these values, it is apparent that the latent-factor-based model heavily outperforms the other models. Moving forward, we will be using the latent-factor-based model (SVD) to conduct cluster-specific recommendations.
Now we do the same thing in the next sections. This time, however, we incorporate clustering with recommender systems. First, we will tackle user-cluster-specific recommendations. What does this mean?
Users are first grouped into clusters based on their preferences or behaviors. Then, recommendations are generated for each user based on the preferences of other users in the same cluster. This approach can help overcome the cold-start problem where new users or items have limited data available for recommendation
These are the recommendations for User Cluster 0.
These are the recommendations for User Cluster 1
These are the evaluation results for each set of recommendations made, for user cluster 0 and user cluster 1.
Earlier, we did user-cluster-specific recommendations. This time, we will do item-cluster-specific recommendations. What does this mean?
Clustering item profiles are used to group similar items together based on their characteristics and attributes. This helps to improve the accuracy and efficiency of the recommendation process by reducing the amount of computation needed to find relevant items for a user.
By grouping similar items, the system can make recommendations that are more tailored to a user's preferences, since items within a cluster are likely to have similar appeal. Moreover, clustering item profiles can help to address the problem of cold start.
These are the recommendations for item cluster 0.
These are the recommendations for item cluster 1.
These are the evaluation results for each set of recommendations made, for item cluster 0 and item cluster 1.
We also make recommendations based on user preference. This is a variation of a combined implementation of clustering and collaborative filtering. Why variation? Because there are many ways to do a combined implementation of clustering and collaborative filtering.
Why do we want this?
Combining clustering of both users and items with recommender systems can lead to better recommendations for users in several ways:
Improved accuracy: Clustering users and items can help identify patterns in user behavior and preferences, as well as in the characteristics of the items being recommended. By leveraging this information, recommender systems can make more accurate recommendations that are tailored to the specific needs and preferences of individual users.
Increased coverage: Clustering can help identify items that may be of interest to users who have not yet interacted with them. By grouping similar items, recommender systems can recommend items that users may not have otherwise discovered, expanding the overall coverage of the system.
Increased diversity: Recommender systems can suffer from the problem of over-specialization, where users are recommended the same types of items over and over again. By clustering items and users, recommender systems can identify diverse sets of recommendations that still meet users' needs and preferences.
Improved scalability: Clustering can help with the scalability of recommender systems by reducing the dimensionality of the data being processed. This can make it easier and faster to generate recommendations, especially when dealing with large datasets.
Another variation of a combined implementation of clustering and collaborative filtering is by changing the utility matrix to have rows as cluster labels of users and columns as cluster labels of items. This can be beneficial in several ways:
Reduced dimensionality: By clustering users and items, the number of rows and columns in the utility matrix is reduced, which can help with the scalability of the recommender system. This can be particularly useful when dealing with large datasets where the full utility matrix may be too large to store or process efficiently.
Improved accuracy: Clustering users and items can help to identify patterns in user behavior and preferences, as well as in the characteristics of the items being recommended. By grouping similar users and items, the recommender system can make more accurate recommendations that are tailored to the specific needs and preferences of each user cluster and item cluster.
Increased diversity: Clustering users and items can also help to increase the diversity of the recommendations being made. By identifying similar users and items within clusters, the recommender system can recommend a more diverse set of items to each user cluster.
Better handling of sparsity: When dealing with sparse data, clustering can help to identify latent relationships between users and items that may not be immediately obvious in the raw data. This can lead to better recommendations, even when there are only a few interactions between users and items.
Why is dimensionality reduction necessary?
Dimensionality reduction is necessary for several reasons:
Explain the rationale behind the choice of dimensionality reduction technique.
There are several dimensionality reduction techniques available which can be applied to data, below are some examples:
However, the choice of technique came down to using SVD. SVD is generally better for sparse data than PCA. It can handle matrices with missing or zero values and can also reduce the dimensionality in data without losing important information.
SVD is the more flexible and robust method, particularly for sparse data.
Since we are working with a utility matrix composed of food ratings, which is classified as sparse data, then SVD is the perfect technique to apply.
How many singular vectors were retained?
Below is the summary of the number of singular vectors retained for each dataset.
User Rating Matrix:
Item Profile Matrix:
What features were used for the item profile matrix?
For the item profile matrix, we used the following features:
The ingredients list was vectorized to represent the sequence of characters or words in a numerical format that the models can understand.
What clustering algorithms/techniques were used?
For clustering, we have decided to perform the following techniques: Note: we only showed three clustering techniques in the main body. The other techniques have been moved over to the appendices.
1. Representative-based Clustering
2. Hierarchical Clustering
3. Density-based Clustering
4. Probabilistic Clustering
Explain the rationale behind the choice of the clustering algorithm.
For our clustering technique, we ultimately decided to use K-means clustering as it provides the following advantages:
How did we choose the optimate k number of clusters?
We performed hyperparameter tuning using a simple grid-search algorithm where we perform clustering using a range of values to search for the best number of clusters.
We must take note that in selecting the optimal number of clusters, we must rely on the internal validation scores instead of the plot visualization. The plot only represents a relatively small amount of information, hence judging the clusters visually can lead to poor conclusions.
Here is a summary of the scores:
Using the DB index, we identified that our optimal number of clusters is k=2.
How did we label the clusters?
The clusters were labeled using a novel two-step method. A classifier is first trained to predict the cluster labels, then the Single Feature Introduction Test (SFIT) method is run on the model to identify the statistically significant features that characterize each cluster.
To perform classification on the clusters, we trained a 3-hidden-layer neural network that has ReLU activation functions, a first hidden size of 100, a second hidden size of 50, and a third hidden size of 25. The network is trained for at most 50 epochs using the Adam optimizer which has pruning capabilities.
After running the classification, we perform the Single Feature Introduction Test (SFIT) on the trained network by only using the data for a specific cluster, and returning a cluster's most important features.
Here are the five most important features of each cluster:
Cluster 0
Cluster 1
These features are what separate one cluster from the other. However, the results do not tell us exactly why these features are considered the most important. It does not tell us the feature's abundance or lack thereof.
Since we are comparing only two clusters, it is likely to get some common significant features for both clusters. In our case, we got calories and saturated fat as commonly significant features. This could mean that for these features, both clusters are likely to demonstrate a significant difference in their mean or median.
To augment our results from the SFIT method, we can check the values for each of these features for both clusters. If you remember, earlier, we took a look at the mean/median of calories and saturated fat for both features, and truly, these clusters are significantly different in terms of calories and saturated fat.
Recap:
Cluster 0
Cluster 1
Now that we have this information, we can then label our clusters.
What is a recommender system?
A recommender system is an information filtering system that predicts and recommends items to users based on their preferences, interests, and past behavior. The goal of a recommender system is to provide personalized recommendations that are useful and relevant to the user.
Why is a recommender system important?
A recommender system is important for several reasons:
What type of recommender systems were tried out?
There are many types of recommender systems and we can go all day talking about them. However, for this study, we only chose to run and test three types namely:
Explain the rationale behind the choice of recommender system.
Each of the three collaborative filtering methods offers its own sets of advantages and disadvantages. It's hard to rank the three in terms of their qualitative aspects. However, by using evaluation metrics, we can easily choose the best recommender system that would perfectly fit the use case. The choice of recommender system was ultimately based on the evaluation metrics.
How did we evaluate the performance of the recommender system?
The performance of the recommender systems was evaluated based on the following metrics:
What libraries were used to create the recommender systems?
For this study, we used the Scikit-Surprise Library. The scikit-surprise library is a popular Python library for building recommender systems using collaborative filtering techniques. We algorithms we decided to use user-based kNN, item-based kNN, and SVD.
In conjunction, we also used the recmetrics library, which is a python library of evaluation metrics and diagnostic tools for recommender systems. Recmetrics accepts the predictions from the scikit-surprise library and returns metrics such as:
What problems did we encounter with dimensionality reduction?
The following were the problems encountered during the implementation of dimensionality reduction:
Since the reduced data still contained a high number of dimensions (539 and 31) we could not rely on the visualizations produced by projecting the first two singular vectors. Normally, the first two singular vectors should contain most of the information, in this case however, it was not enough for us to completely rely on the 2D plot.
The internal validation scores do not have a common majority top-ranked k number of clusters. Instead, you will have to use your judgment and use the information provided by the internal validation scores to decide the optimal number of clusters.
What problems did we encounter with clustering?
The following were the problems encountered during the implementation of clustering:
What problems did we encounter with the implementation of the recommender system?
Among the collaborative filtering techniques used, the best-performing one was latent-factor-based collaborative filtering. The algorithm was able to capture the complex relationships between users and items and it was able to make more accurate predictions than the user-based and item-based approaches.
User-based collaborative filtering can potentially funnel recommendations. This phenomenon causes the recommendations to be almost identical for all users.
The SFIT method does not work well with datasets having numerous features.
The internal validation scores should hold more weight than visual appeal in two-dimensional space when deciding on the optimal number of clusters. This is especially true for data of high dimensionality.
Data transformation should be performed user-wise, and not item-wise.
The latent-based recommender system was able to address the inherent problem being complained of by Chef Almond:
The choice of the clustering method depends on several factors, such as the nature of the data, the desired clustering outcome, and the resources available.
Choosing the best number of clusters should not be based on the clustering that has the best visual appeal.
This study can be further improved through the application of other clustering techniques and algorithms which could potentially reveal more meaningful insights from our data. See below for some examples of other clustering techniques and algorithms:
Perform Content-based Collaborative Filtering to Capture the contributions of the ingredients as the features of each recipe profile.
Try out other algorithms from the scikit-Surprise Library to explore possible algorithms that better fit the dataset such as:
random_pred.NormalPredictor - Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
baseline_only.BaselineOnly - Algorithm predicting the baseline estimate for a given user and item.
knns.KNNBasic - A basic collaborative filtering algorithm.
knns.KNNWithMeans - A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
knns.KNNWithZScore - A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.
knns.KNNBaseline - A basic collaborative filtering algorithm taking into account a baseline rating.
matrix_factorization.SVD - The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
matrix_factorization.SVDpp - The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
matrix_factorization.NMF - A collaborative filtering algorithm based on Non-negative Matrix Factorization.
slope_one.SlopeOne - A simple yet accurate collaborative filtering algorithm.
co_clustering.CoClustering - A collaborative filtering algorithm based on co-clustering
Explore the batch recommender system. This can be done by clustering both the users and items where their labels will serve as the rows and columns, respectively. The cluster mean will be used as the value for the utility matrix. This implementation will dramatically reduce the run time as the recommendations will be the same for each cluster.
Combine the frequent itemset mining with recommender systems. By doing so, we can use the association rules to generate recommendations for the users. This can potentially provide better user recommendations.
Explore other food categories like main dish, French, or 30-minute dish. This can also be applied to other types of food or desserts.
Recommender systems are considered domain-agnostic and may be applied to other fields such as media, retail, clothing, and transportation.
Another way to make the study more interesting is to reveal clusters within clusters. Although more complex at the onset, it can potentially lead to more meaningful insights.
[1] Hunter. (2020, March). New survey reveals the pandemic’s impact on Americans’ eating behaviors [Press release]. Retrieved from https://www.hunterpr.com/newsroom/new-survey-reveals-the-pandemics-impact-on-americans-eating-behaviors/
[2] International Food Information Council. (2020). 2020 Food & Health Survey: COVID-19 Pandemic’s Impact on Food and Health. Retrieved from https://foodinsight.org/2020-food-and-health-survey/
[3] Food.com. (n.d.). Retrieved March 11, 2023, from https://www.food.com/
[4] Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37. doi: 10.1109/MC.2009.263
[5] Leskovec, J., Rajaraman A., and Ullman J. (2011). Mining of Massive Datasets. Retrieved from http://infolab.stanford.edu/~ullman/mmds/book.pdf (p. 326).
[6] Lavorini, V. (2018, November 22). Gaussian mixture model clusterization: How to select the number of components (clusters). Towards Data Science. https://towardsdatascience.com/gaussian-mixture-model-clusterization-how-to-select-the-number-of-components-clusters-553bef45f6e4
[7] DiFrancesco, V. (2021, February 25). Gaussian Mixture Models for Clustering. A beginner’s guide for expanding your clustering knowledge beyond K-Means. https://towardsdatascience.com/gaussian-mixture-models-for-clustering-3f62d0da675
[8] Deutschman, Z. (2023, January 24). Recommender Systems: Machine Learning Metrics and Business Metrics https://neptune.ai/blog/recommender-systems-metrics.
[9] Longo, C. (2018, November 23). Evaluation Metrics for Recommender Systems. Towards Data Science. https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093
[10] Zhou, T., Kuscsik, Z., Liu, J. G., Medo, M., Wakeling, J. R., & Zhang, Y. C. (2010). Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences, 107(10), 4511-4515. https://arxiv.org/pdf/0808.2670.pdf
[11] Ge, M., Delgado-Battenfeld, C., & Jannach, D. (2010, September). Beyond accuracy: evaluating recommender systems by coverage and serendipity. In Proceedings of the fourth ACM conference on Recommender systems (pp. 257-260). ACM.
[12] Lendave, V. (2021, October 24). How to Measure the Success of a Recommendation System? Analytics India Magazine. https://analyticsindiamag.com/how-to-measure-the-success-of-a-recommendation-system/
[13] Longo, C. (2021). Recmetrics. A python library of evaluation metrics and diagnostic tools for recommender systems. GitHub. https://github.com/statisticianinstilettos/recmetrics
[14] Horel, E., Giesecke, K., Storchan, V., Chittar, N. (2020). Explainable Clustering and Application to Wealth Management Compliance. https://arxiv.org/pdf/1909.13381.pdf
[15] Horel, E., Giesecke, K. (2019). Computationally Efficient Feature Significance and Importance for Machine Learning Models. https://arxiv.org/pdf/1905.09849.pdf
[16] Agrawal, S. (2021). Recommendation System - Understanding The Basic Concepts. What Is Recommendation System? https://www.analyticsvidhya.com/blog/2021/07/recommendation-system-understanding-the-basic-concepts/